1 BUSCOMP Run Summary

BUSCOMP V0.7.1: run Thu May  9 10:13:22 2019

See the run details appendix end of this document for details of the log file, commandline parameters and runtime BUSCOMP errors and warnings.

NOTE: To edit this document, open yeast.N3L20ID0U.full.Rmd in RStudio, edit and re-knit to HTML.

1.1 BUSCOMP Results Summary

Assemblies can be assessed on a number of criteria, but the main ones (in the absence of a reference “truth” genome) are either to judge contiguity or completeness. NG50 and LG50 values are based on a genome size of 13.1 Mb. If the genomesize=X parameter was not set (see command list in appendix), this will be based on the longest assembly (see sequence stats, below).

Of the 4 assemblies analysed (4 BUSCO; 4 fasta; 4 both), 3 genomes were rated as the “best” by at least one criterion:

  • PacBioHQ: NG50Length, LG50Count, MaxLength, Complete, Missing, BUSCO.
  • PacBioWTDBG2: Complete, Missing.
  • SGD: LG50Count, Complete, Missing, NoBUSCO.

Best assemblies by assembly contiguity critera:

  • NG50Length. Longest NG50 contig/scaffold length (930,848 bp): PacBioHQ
  • LG50Count. Smallest LG50 contig/scaffold count (6): PacBioHQ, SGD
  • MaxLength. Maximum contig/scaffold length (1,553,502 bp): PacBioHQ

Best assemblies by completeness critera:

  • Complete. Most Complete (Single & Duplicated) BUSCOMP sequences (99.9 %): PacBioHQ, PacBioWTDBG2, SGD
  • Missing. Fewest Missing BUSCOMP sequences (0.0 %): PacBioHQ, PacBioWTDBG2, SGD
  • BUSCO. Most Complete (Single & Duplicated) BUSCO sequences (97.8 %): PacBioHQ
  • NoBUSCO. Fewest Missing BUSCO sequences (0.9 %): SGD

2 Genome Summary

The following genomes and BUSCO results were analysed by BUSCOMP:

  • SGD. [BUSCO|Fasta] SGD R64.2.1 reference genome (strain S288c)
  • PacBioHQ. [BUSCO|Fasta] High quality PacBio assembly of strain MBG344 (similar to S288c)
  • chrIIIdup. [BUSCO|Fasta] Duplicated chromosome III contig from High Quality PacBio assembly
  • PacBioWTDBG2. [BUSCO|Fasta] WTDBG2 PacBio assembly of strain MBG344 (similar to S288c)

Details of the directories and files are below:

Directory Prefix Genome Fasta Sequences
../busco3/run_SGDR64.2.1 SGDR64.2.1 SGD ../fasta/SGDR64.2.1.fsa True
../busco3/run_MBG344001 MBG344001 PacBioHQ ../fasta/MBG344001.fsa True
../busco3/run_chrIIIdup chrIIIdup chrIIIdup ../fasta/chrIIIdup.fsa True
../busco3/run_MBG344WTDBG2 MBG344WTDBG2 PacBioWTDBG2 ../fasta/MBG344WTDBG2.fsa True

Genomes with a Directory listed had BUSCO results available. If Sequences is True, these would be have been compiled to generate the BUSCOMP sequence set (unless buscompseq=F, or alternative sequences were provided with buscofas=FASFILE). Genomes with a Fasta listed had sequence data available for BUSCOMP searches.

2.1 Genome statistics

The following genome statistics were also calculated by RJE_SeqList for each genome (table, below):

  • SeqNum: The total number of scaffolds/contigs in the assembly.
  • TotLength: The total combined length of scaffolds/contigs in the assembly.
  • MinLength: The length of the shortest scaffold/contig in the assembly.
  • MaxLength: The length of the longest scaffold/contig in the assembly.
  • MeanLength: The mean length of scaffolds/contigs in the assembly.
  • MedLength: The median length of scaffolds/contigs in the assembly.
  • N50Length: At least half of the assembly is contained on scaffolds/contigs of this length or greater.
  • L50Count: The smallest number scaffolds/contigs needed to cover half the the assembly.
  • NG50Length: At least half of the genome is contained on scaffolds/contigs of this length or greater. This is based on genomesize=X. If no genome size is given, it will be relative to the biggest assembly.
  • LG50Count: The smallest number scaffolds/contigs needed to cover half the the genome. This is based on genomesize=X. If no genome size is given, it will be relative to the biggest assembly.
  • GapLength: The total number of undefined “gap” (N) nucleotides in the assembly.
  • GC: The %GC content of the assembly.
Genome SeqNum TotLength MinLength MaxLength MeanLength MedLength N50Length L50Count NG50Length LG50Count
SGD 18 12163423 6318 1531933 675745.7 706283.5 924431 6 924431 6
PacBioHQ 17 12275942 85779 1553502 722114.2 746591.0 930848 6 930848 6
chrIIIdup 2 695514 347757 347757 347757.0 347757.0 347757 1 0 -1
PacBioWTDBG2 29 12109398 4007 1524521 417565.4 344232.0 764884 6 755429 7

NOTE: NG50Length and LG50Count statistics use genomesize=X or the biggest assembly loaded (13.10 Mb). If BUSCOMP has been run more than once on the same data (e.g. to update descriptions or sorting), please make sure that a consistent genome size is used, or these values may be wrong. If in doubt, run with force=T and force regeneration of statistics.

2.2 Genome coverage assessment plots

In general, a good assembly will be approx. the same size as the genome and in as few pieces as possible. Any assembly smaller than the predicted genome size is clearly missing coverage. Assemblies bigger than the genome size might still be missing chunks of the genome if redundancy/duplication is a problem. In the following plot, the grey line marks the given genome size of 13.1 Mb.

A better indicator of the overall coverage of the genome is the number of Missing BUSCO genes. As BUSCO is highly dependent on the accuracy of the sequence and the gene models it makes, the Missing BUSCOMP ratings arguably give a more consistent proxy for genome completeness. NOTE: this says nothing about the fragmentation or completeness of the genes themselves.

2.3 Genome contiguity assessment plots

In general, a good assembly will be in fewer, bigger pieces. This is approximated using NG50 and LG50, which are the min. length and number of contigs/scaffolds required to cover at least half the genome. These stats use the given genome size of 13.1 Mb.

NOTE: To modify these plots and tables, edit the *.genomes.tdt and *.NxLxxIDxx.rdata.tdt files and re-knit the *.NxLxxIDxx.Rmd file.

3 BUSCO Ratings

Compiled BUSCO results for 4 assemblies and 2 groups have been saved in yeast.genomes.tdt. BUSCO ratings are defined (quoting from the BUSCO v3 User Guide as:

  • Complete: Single-copy hits where “BUSCO matches have scored within the expected range of scores and within the expected range of length alignments to the BUSCO profile.”
  • Duplicated: As Complete but 2+ copies.
  • Fragmented: “BUSCO matches … within the range of scores but not within the range of length alignments to the BUSCO profile.”
  • Missing: “Either no significant matches at all, or the BUSCO matches scored below the range of scores for the BUSCO profile.”
Genome N Complete Single Duplicated Fragmented Missing
SGD 1711 1683 1672 11 12 16
PacBioHQ 1711 1684 1673 11 10 17
chrIIIdup 1711 45 0 45 0 1666
HighQuality 1711 1684 1673 11 11 16
PacBioWTDBG2 1711 1384 1375 9 135 192
BUSCOMP 1711 1689 1681 8 9 13

3.1 Genome Groups

BUSCOMP compiled the following groups of genomes (where BUSCO data was loaded), keeping the “best” rating for each BUSCO gene across the group:

  • HighQuality: SGD PacBioHQ chrIIIdup
  • BUSCOMP: SGD PacBioHQ chrIIIdup PacBioWTDBG2

3.2 BUSCO Summary

SGD BUSCO Results:
        C:98.4%[S:97.7%,D:0.6%],F:0.7%,M:0.9%,n:1711

PacBioHQ BUSCO Results:
        C:98.4%[S:97.8%,D:0.6%],F:0.6%,M:1.0%,n:1711

chrIIIdup BUSCO Results:
        C:2.6%[S:0.0%,D:2.6%],F:0.0%,M:97.4%,n:1711

HighQuality BUSCO Results:
        C:98.4%[S:97.8%,D:0.6%],F:0.6%,M:0.9%,n:1711

PacBioWTDBG2 BUSCO Results:
        C:80.9%[S:80.4%,D:0.5%],F:7.9%,M:11.2%,n:1711

BUSCOMP BUSCO Results:
        C:98.7%[S:98.2%,D:0.5%],F:0.5%,M:0.8%,n:1711

3.3 BUSCO Gene Details

Full BUSCO results with ratings for each gene have been compiled in yeast.busco.tdt:

4 BUSCOMP Ratings

The best complete BUSCO hit results (based on Score and Length) have been compiled in yeast.buscoseq.tdt. The Genome field indicates the assembly with the best hit, which is followed by details of that hit (Contig, Start, End, Score, Length). BUSCOMP ratings for each assembly are then given in subsequent fields:

  • Identical: 100% coverage and 100% identity in at least one contig/scaffold.
  • Complete: 95%+ Coverage in a single contig/scaffold. (Note: accuracy/identity is not considered.)
  • Duplicated: 95%+ Coverage in 2+ contigs/scaffolds.
  • Fragmented: 95%+ combined coverage but not in any single contig/scaffold.
  • Partial: 40-95% combined coverage.
  • Ghost: Hits meeting local cutoff but <40% combined coverage.
  • Missing: No hits meeting local cutoff.

4.1 BUSCOSeq Rating Summary

BUSCOMP ratings (see above) are compiled to summary statistics in yeast.N3L20ID0U.ratings.tdt. Note that Identical ratings in this table will also be rated as Complete, which in turn are Single or Duplicated. Percentage summaries are plotted below, along with a BUSCO-style one-line summary per assembly/group.

NOTE: Group summaries do not include Identical ratings.

X. Genome N Identical Complete Single Duplicated Fragmented Partial Ghost Missing
1 SGD 1681 1609 1679 1678 1 0 2 0 0
2 PacBioHQ 1681 1611 1679 1678 1 0 2 0 0
3 chrIIIdup 1681 43 45 0 45 0 1 1 1634
4 HighQuality 1681 0 1679 1678 1 0 2 0 0
5 PacBioWTDBG2 1681 1018 1679 1677 2 0 2 0 0
6 BUSCOMP 1681 0 1679 1678 1 0 2 0 0

BUSCOMP BUSCOMP Results [1681 (98.25%) Complete BUSCOs; 0 (0.00%) BUSCOMP Seqs]:
        C:99.9%[S:99.8%,D:0.1%],F:0.0%,P:0.1%,G:0.0%,M:0.0%,n:1681

HighQuality BUSCOMP Results [1673 (97.78%) Complete BUSCOs; 0 (0.00%) BUSCOMP Seqs]:
        C:99.9%[S:99.8%,D:0.1%],F:0.0%,P:0.1%,G:0.0%,M:0.0%,n:1681

PacBioHQ BUSCOMP Results [1673 (97.78%) Complete BUSCOs; 1599 (95.12%) BUSCOMP Seqs]:
        C:99.9%[S:99.8%,D:0.1%,I:95.8%],F:0.0%,P:0.1%,G:0.0%,M:0.0%,n:1681

PacBioWTDBG2 BUSCOMP Results [1375 (80.36%) Complete BUSCOs; 77 (4.58%) BUSCOMP Seqs]:
        C:99.9%[S:99.8%,D:0.1%,I:60.6%],F:0.0%,P:0.1%,G:0.0%,M:0.0%,n:1681

SGD BUSCOMP Results [1672 (97.72%) Complete BUSCOs; 5 (0.30%) BUSCOMP Seqs]:
        C:99.9%[S:99.8%,D:0.1%,I:95.7%],F:0.0%,P:0.1%,G:0.0%,M:0.0%,n:1681

chrIIIdup BUSCOMP Results [0 (0.00%) Complete BUSCOs; 0 (0.00%) BUSCOMP Seqs]:
        C:2.7%[S:0.0%,D:2.7%,I:2.6%],F:0.0%,P:0.1%,G:0.1%,M:97.2%,n:1681

4.2 BUSCOSeq Full Results Table

Full BUSCOMP results with ratings for each gene in every assembly and group have been compiled in yeast.N3L20ID0U.buscomp.tdt:

5 BUSCO and BUSCOMP Comparisons

5.1 BUSCO to BUSCOMP Rating Changes

Ratings changes from BUSCO to BUSCOMP (where NULL ratings indicate no BUSCOMP sequence):

BUSCO BUSCOMP SGD PacBioHQ chrIIIdup PacBioWTDBG2 TOTAL
Complete Complete 1670 1671 0 1372 4713
Complete Duplicated 0 0 0 1 1
Complete Partial 2 2 0 2 6
Duplicated Complete 2 2 0 1 5
Duplicated Duplicated 1 1 45 0 47
Duplicated NULL 8 8 0 8 24
Fragmented Complete 3 1 0 126 130
Fragmented Duplicated 0 0 0 1 1
Fragmented NULL 9 9 0 8 26
Missing Complete 3 4 0 178 185
Missing Ghost 0 0 1 0 1
Missing Missing 0 0 1634 0 1634
Missing NULL 13 13 30 14 70
Missing Partial 0 0 1 0 1

Full table of Ratings changes from by gene:

Complete, Duplicated, Fragmented, Partial, Ghost, Missing, NULL (no BUSCOMP sequence)

5.1.1 BUSCOMP Gain test

There is a risk that performing a low stringency search will identify homologues or pseudogenes of the desired BUSCO gene in error. If there is a second copy of a gene in the genome that is detectable by the search then we would expect the same genes that go from Missing to Complete in some genomes to go from Single to Duplicated in others.

To test this, data is reduced for each pair of genomes to BUSCO-BUSCOMP rating pairs of:

  • Single-Single
  • Single-Duplicated
  • Missing-Missing
  • Missing-Single

This is then converted in to Gain ratings (Single-Duplicated & Missing-Single) or No Gain ratings (Single-Single & Missing-Missing). The Single-Duplicated shift in one genome is then used to set the expected Missing-Single shift in the other, and assess the probability of observing the Missing-Single shift using a cumulative binomial distribution, where:

  • k is the number of observed GG pairs (Single-Duplicated and Missing-Single)
  • n is the number of Missing-Single Gains in the focal genome (NG+GG)
  • p is the proportion of Single-Duplicated Gains in the background genome (GN+GG / (GN+GG+NN+NG))
  • pB is the probability of observing k+ Missing-Single gains, given p and n

This is output to *.gain.tdt, where each row is a Genome and each field gives the probability of the row genome’s Missing-Single gains, given the column genome’s Single-Duplicated gains:

Genome SGD PacBioHQ chrIIIdup PacBioWTDBG2
PacBioHQ 1 1 1 1
PacBioWTDBG2 1 1 1 1
SGD 1 1 1 1
chrIIIdup 1 1 1 1

Low probabilities indicate that BUSCOMP might be rating paralogues or pseudogenes and not functional orthologues of the BUSCO gene. Note that there is no correction for multiple testing, nor any adjustment for lack of independence between samples.

5.2 Unique BUSCO and BUSCOMP Complete Genes

BUSCO and BUSCOMP Complete ratings were compared for each BUSCO gene to identify those genes unique to either a single assembly or a group of assemblies. The BUSCOMP group is excluded from this analysis, as (typically) are other redundant groups wholly contained within another group. (Inclusion of such groups is guaranteed to result in 2+ groups containing any Complete BUSCOs they have.)

SGD unique Complete genes: 0 BUSCO; 0 BUSCOMP
PacBioHQ unique Complete genes: 1 BUSCO; 0 BUSCOMP
chrIIIdup unique Complete genes: 0 BUSCO; 0 BUSCOMP
PacBioWTDBG2 unique Complete genes: 5 BUSCO; 0 BUSCOMP
HighQuality unique Complete genes: 304 BUSCO; 0 BUSCOMP

5.3 Ratings for Missing BUSCO genes

In addition to the unique ratings (above), it can be useful to know how genes Missing from one assembly/group are rated in the others. These plots are generated for each assembly/group in turn. The full BUSCO (*.busco.tdt) and BUSCOMP (*.LnnIDxx.buscomp.tdt) tables are reduced to the subset of genes that are missing in the assembly/group of interest, and then the summary ratings recalculated for that subset.

In each case, three plots are made (assuming both BUSCO and BUSCOMP data is available):

  1. BUSCO ratings for missing BUSCO genes.
  2. BUSCOMP ratings for missing BUSCO genes. As well as being more relaxed than pure BUSCO results, this will indicate when BUSCOMP has found a gene in the focal assembly/group where BUSCO did not.
  3. BUSCOMP ratings for missing BUSCOMP genes. It is expected that assemblies will be much more similar in terms of BUSCOMP coverage.

5.4 Missing SGD BUSCO genes

BUSCO ratings for Missing SGD BUSCO genes:

BUSCOMP ratings for Missing SGD BUSCO genes:

BUSCOMP ratings for Missing SGD BUSCOMP genes:

5.5 Missing PacBioHQ BUSCO genes

BUSCO ratings for Missing PacBioHQ BUSCO genes:

BUSCOMP ratings for Missing PacBioHQ BUSCO genes:

BUSCOMP ratings for Missing PacBioHQ BUSCOMP genes:

5.6 Missing chrIIIdup BUSCO genes

BUSCO ratings for Missing chrIIIdup BUSCO genes:

BUSCOMP ratings for Missing chrIIIdup BUSCO genes:

BUSCOMP ratings for Missing chrIIIdup BUSCOMP genes:

5.7 Missing HighQuality BUSCO genes

BUSCO ratings for Missing HighQuality BUSCO genes:

BUSCOMP ratings for Missing HighQuality BUSCO genes:

BUSCOMP ratings for Missing HighQuality BUSCOMP genes:

5.8 Missing PacBioWTDBG2 BUSCO genes

BUSCO ratings for Missing PacBioWTDBG2 BUSCO genes:

BUSCOMP ratings for Missing PacBioWTDBG2 BUSCO genes:

BUSCOMP ratings for Missing PacBioWTDBG2 BUSCOMP genes:

5.9 Missing BUSCOMP BUSCO genes

BUSCO ratings for Missing BUSCOMP BUSCO genes:

BUSCOMP ratings for Missing BUSCOMP BUSCO genes:

BUSCOMP ratings for Missing BUSCOMP BUSCOMP genes:

6 Appendix: BUSCOMP run details

BUSCOMP V0.7.1: run Thu May  9 10:13:22 2019

This analysis was run in:

/Users/redwards/code/buscomp/example/run
  • Log file: /Users/redwards/code/buscomp/example/run/yeast.log
  • Commandline arguments: ini=../run/example.ini i=-1 forks=4
  • Full Command List: minimap2=/Users/redwards/Data/BiowareOSX/minimap2/minimap2 ini genomesize=13.1e6 genomes=../example.genomes.csv groups=../example.groups.csv runs=../busco3/run_* fastadir=../fasta/ basefile=yeast i=-1 forks=4

6.1 BUSCOMP errors

BUSCOMP returned no runtime errors.

6.2 BUSCOMP warnings

See run log for further details:

#WARN   00:00:04    "Single copy" BUSCO EOG092E01WO has 2+ sequences in ../busco3/run_MBG344001/single_copy_busco_sequences/EOG092E01WO.fna! (Keeping first.)
#WARN   00:00:04    "Single copy" BUSCO EOG092E01WO has 2+ sequences in ../busco3/run_MBG344001/single_copy_busco_sequences/EOG092E01WO.faa! (Keeping first.)
#WARN   00:00:06    "Single copy" BUSCO EOG092E0EIP has 2+ sequences in ../busco3/run_MBG344WTDBG2/single_copy_busco_sequences/EOG092E0EIP.fna! (Keeping first.)
#WARN   00:00:06    "Single copy" BUSCO EOG092E0EIP has 2+ sequences in ../busco3/run_MBG344WTDBG2/single_copy_busco_sequences/EOG092E0EIP.faa! (Keeping first.)

Report contents:


Output generated by BUSCOMP v0.7.1 © 2019 Richard Edwards | richard.edwards@unsw.edu.au